CS3600: Deep Learning

Course project

Raichman Univercity

Submitted by:

# Name Id email
Student 1 Gil zeevi 203909320 gil.zeevi@post.idc.ac.il
Student 2 Joel Liurner 346243579 joel.liurner@post.idc.ac.il

Introduction

In this assignment you will:

  1. learn to generate images and implement two different generative models:
    • Variational autoencoder
    • generative adversarial network.
  2. answer course summery questions
  3. implement a mini project

General Guidelines

Good luck with the project and with your exams!

$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bm}[1]{{\bf #1}} \newcommand{\bb}[1]{\bm{\mathrm{#1}}} $$

Part 1: Variational Autoencoder

In this part we will learn to generate new data using a special type of autoencoder model which allows us to sample from its latent space. We'll implement and train a VAE and use it to generate new images.

Obtaining the dataset

Let's begin by downloading a dataset of images that we want to learn to generate. We'll use the Labeled Faces in the Wild (LFW) dataset which contains many labeled faces of famous individuals.

We're going to train our generative model to generate a specific face, not just any face. Since the person with the most images in this dataset is former president George W. Bush, we'll set out to train a Bush Generator :)

However, if you feel adventurous and/or prefer to generate something else, feel free to edit the PART2_CUSTOM_DATA_URL variable in hw4/answers.py.

Create a Dataset object that will load the extraced images:

OK, let's see what we got. You can run the following block multiple times to display a random subset of images from the dataset.

The Variational Autoencoder

An autoencoder is a model which learns a representation of data in an unsupervised fashion (i.e without any labels). Recall it's general form from the lecture:

An autoencoder maps an instance $\bb{x}$ to a latent-space representation $\bb{z}$. It has an encoder part, $\Phi_{\bb{\alpha}}(\bb{x})$ (a model with parameters $\bb{\alpha}$) and a decoder part, $\Psi_{\bb{\beta}}(\bb{z})$ (a model with parameters $\bb{\beta}$).

While autoencoders can learn useful representations, generally it's hard to use them as generative models because there's no distribution we can sample from in the latent space. In other words, we have no way to choose a point $\bb{z}$ in the latent space such that $\Psi(\bb{z})$ will end up on the data manifold in the instance space.

The variational autoencoder (VAE), first proposed by Kingma and Welling, addresses this issue by taking a probabilistic perspective. Briefly, a VAE model can be described as follows.

We define, in Baysean terminology,

To create our variational decoder we'll further specify:

This setting allows us to generate a new instance $\bb{x}$ by sampling $\bb{z}$ from the multivariate normal distribution, obtaining the instance-space mean $\Psi _{\bb{\beta}}(\bb{z})$ using our decoder network, and then sampling $\bb{x}$ from $\mathcal{N}( \Psi _{\bb{\beta}}(\bb{z}) , \sigma^2 \bb{I} )$.

Our variational encoder will approximate the posterior with a parametric distribution $q _{\bb{\alpha}}(\bb{Z} | \bb{x}) = \mathcal{N}( \bb{\mu} _{\bb{\alpha}}(\bb{x}), \mathrm{diag}\{ \bb{\sigma}^2_{\bb{\alpha}}(\bb{x}) \} )$. The interpretation is that our encoder model, $\Phi_{\vec{\alpha}}(\bb{x})$, calculates the mean and variance of the posterior distribution, and samples $\bb{z}$ based on them. An important nuance here is that our network can't contain any stochastic elements that depend on the model parameters, otherwise we won't be able to back-propagate to those parameters. So sampling $\bb{z}$ from $\mathcal{N}( \bb{\mu} _{\bb{\alpha}}(\bb{x}), \mathrm{diag}\{ \bb{\sigma}^2_{\bb{\alpha}}(\bb{x}) \} )$ is not an option. The solution is to use what's known as the reparametrization trick: sample from an isotropic Gaussian, i.e. $\bb{u}\sim\mathcal{N}(\bb{0},\bb{I})$ (which doesn't depend on trainable parameters), and calculate the latent representation as $\bb{z} = \bb{\mu} _{\bb{\alpha}}(\bb{x}) + \bb{u}\odot\bb{\sigma}_{\bb{\alpha}}(\bb{x})$.

To train a VAE model, we maximize the evidence distribution, $p(\bb{X})$ (see question below). The VAE loss can therefore be stated as minimizing $\mathcal{L} = -\mathbb{E}_{\bb{x}} \log p(\bb{X})$. Although this expectation is intractable, we can obtain a lower-bound for $p(\bb{X})$ (the evidence lower bound, "ELBO", shown in the lecture):

$$ \log p(\bb{X}) \ge \mathbb{E} _{\bb{z} \sim q _{\bb{\alpha}} }\left[ \log p _{\bb{\beta}}(\bb{X} | \bb{z}) \right] - \mathcal{D} _{\mathrm{KL}}\left(q _{\bb{\alpha}}(\bb{Z} | \bb{X})\,\left\|\, p(\bb{Z} )\right.\right) $$

where $ \mathcal{D} _{\mathrm{KL}}(q\left\|\right.p) = \mathbb{E}_{\bb{z}\sim q}\left[ \log \frac{q(\bb{Z})}{p(\bb{Z})} \right] $ is the Kullback-Liebler divergence, which can be interpreted as the information gained by using the posterior $q(\bb{Z|X})$ instead of the prior distribution $p(\bb{Z})$.

Using the ELBO, the VAE loss becomes, $$ \mathcal{L}(\vec{\alpha},\vec{\beta}) = \mathbb{E} {\bb{x}} \left[ \mathbb{E} {\bb{z} \sim q {\bb{\alpha}} }\left[ -\log p {\bb{\beta}}(\bb{x} | \bb{z}) \right]

By remembering that the likelihood is a Gaussian distribution with a diagonal covariance and by applying the reparametrization trick, we can write the above as

$$ \mathcal{L}(\vec{\alpha},\vec{\beta}) = \mathbb{E} _{\bb{x}} \left[ \mathbb{E} _{\bb{z} \sim q _{\bb{\alpha}} } \left[ \frac{1}{2\sigma^2}\left\| \bb{x}- \Psi _{\bb{\beta}}\left( \bb{\mu} _{\bb{\alpha}}(\bb{x}) + \bb{\Sigma}^{\frac{1}{2}} _{\bb{\alpha}}(\bb{x}) \bb{u} \right) \right\| _2^2 \right] + \mathcal{D} _{\mathrm{KL}}\left(q _{\bb{\alpha}}(\bb{Z} | \bb{x})\,\left\|\, p(\bb{Z} )\right.\right) \right]. $$

Model Implementation

Obviously our model will have two parts, an encoder and a decoder. Since we're working with images, we'll implement both as deep convolutional networks, where the decoder is a "mirror image" of the encoder implemented with adjoint (AKA transposed) convolutions. Between the encoder CNN and the decoder CNN we'll implement the sampling from the parametric posterior approximator $q_{\bb{\alpha}}(\bb{Z}|\bb{x})$ to make it a VAE model and not just a regular autoencoder (of course, this is not yet enough to create a VAE, since we also need a special loss function which we'll get to later).

First let's implement just the CNN part of the Encoder network (this is not the full $\Phi_{\vec{\alpha}}(\bb{x})$ yet). As usual, it should take an input image and map to a activation volume of a specified depth. We'll consider this volume as the features we extract from the input image. Later we'll use these to create the latent space representation of the input.

TODO: Implement the EncoderCNN class in the hw4/autoencoder.py module. Implement any CNN architecture you like. If you need "architecture inspiration" you can see e.g. this or this paper.

Now let's implement the CNN part of the Decoder. Again this is not yet the full $\Psi _{\bb{\beta}}(\bb{z})$. It should take an activation volume produced by your EncoderCNN and output an image of the same dimensions as the Encoder's input was. This can be a CNN which is like a "mirror image" of the the Encoder. For example, replace convolutions with transposed convolutions, downsampling with up-sampling etc. Consult the documentation of ConvTranspose2D to figure out how to reverse your convolutional layers in terms of input and output dimensions. Note that the decoder doesn't have to be exactly the opposite of the encoder and you can experiment with using a different architecture.

TODO: Implement the DecoderCNN class in the hw4/autoencoder.py module.

Let's now implement the full VAE Encoder, $\Phi_{\vec{\alpha}}(\vec{x})$. It will work as follows:

  1. Produce a feature vector $\vec{h}$ from the input image $\vec{x}$.
  2. Use two affine transforms to convert the features into the mean and log-variance of the posterior, i.e. $$ \begin{align}
     \bb{\mu} _{\bb{\alpha}}(\bb{x}) &= \vec{h}\mattr{W}_{\mathrm{h\mu}} + \vec{b}_{\mathrm{h\mu}} \\
     \log\left(\bb{\sigma}^2_{\bb{\alpha}}(\bb{x})\right) &= \vec{h}\mattr{W}_{\mathrm{h\sigma^2}} + \vec{b}_{\mathrm{h\sigma^2}}
    
    \end{align} $$
  3. Use the reparametrization trick to create the latent representation $\vec{z}$.

Notice that we model the log of the variance, not the actual variance. The above formulation is proposed in appendix C of the VAE paper.

TODO: Implement the encode() method in the VAE class within the hw4/autoencoder.py module. You'll also need to define your parameters in __init__().

Let's sample some 2d latent representations for an input image x0 and visualize them.

Let's now implement the full VAE Decoder, $\Psi _{\bb{\beta}}(\bb{z})$. It will work as follows:

  1. Produce a feature vector $\tilde{\vec{h}}$ from the latent vector $\vec{z}$ using an affine transform.
  2. Reconstruct an image $\tilde{\vec{x}}$ from $\tilde{\vec{h}}$ using the decoder CNN.

TODO: Implement the decode() method in the VAE class within the hw4/autoencoder.py module. You'll also need to define your parameters in __init__(). You may need to also re-run the block above after you implement this.

Our model's forward() function will simply return decode(encode(x)) as well as the calculated mean and log-variance of the posterior.

Loss Implementation

In practice, since we're using SGD, we'll drop the expectation over $\bb{X}$ and instead sample an instance from the training set and compute a point-wise loss. Similarly, we'll drop the expectation over $\bb{Z}$ by sampling from $q_{\vec{\alpha}}(\bb{Z}|\bb{x})$. Additionally, because the KL divergence is between two Gaussian distributions, there is a closed-form expression for it. These points bring us to the following point-wise loss:

$$ \ell(\vec{\alpha},\vec{\beta};\bb{x}) = \frac{1}{\sigma^2 d_x} \left\| \bb{x}- \Psi _{\bb{\beta}}\left( \bb{\mu} _{\bb{\alpha}}(\bb{x}) + \bb{\Sigma}^{\frac{1}{2}} _{\bb{\alpha}}(\bb{x}) \bb{u} \right) \right\| _2^2 + \mathrm{tr}\,\bb{\Sigma} _{\bb{\alpha}}(\bb{x}) + \|\bb{\mu} _{\bb{\alpha}}(\bb{x})\|^2 _2 - d_z - \log\det \bb{\Sigma} _{\bb{\alpha}}(\bb{x}), $$

where $d_z$ is the dimension of the latent space, $d_x$ is the dimension of the input and $\bb{u}\sim\mathcal{N}(\bb{0},\bb{I})$. This pointwise loss is the quantity that we'll compute and minimize with gradient descent. The first term corresponds to the data-reconstruction loss, while the second term corresponds to the KL-divergence loss. Note that the scaling by $d_x$ is not derived from the original loss formula and was added directly to the pointwise loss just to normalize the data term.

TODO: Implement the vae_loss() function in the hw4/autoencoder.py module.

Sampling

The main advantage of a VAE is that it can by used as a generative model by sampling the latent space, since we optimize for a isotropic Gaussian prior $p(\bb{Z})$ in the loss function. Let's now implement this so that we can visualize how our model is doing when we train.

TODO: Implement the sample() method in the VAE class within the hw4/autoencoder.py module.

Training

Time to train!

TODO:

  1. Implement the VAETrainer class in the hw4/training.py module. Make sure to implement the checkpoints feature of the Trainer class if you haven't done so already in Part 1.
  2. Tweak the hyperparameters in the part2_vae_hyperparams() function within the hw4/answers.py module.

TODO:

  1. Run the following block to train. It will sample some images from your model every few epochs so you can see the progress.
  2. When you're satisfied with your results, rename the checkpoints file by adding _final. When you run the main.py script to generate your submission, the final checkpoints file will be loaded instead of running training. Note that your final submission zip will not include the checkpoints/ folder. This is OK.

The images you get should be colorful, with different backgrounds and poses.

Questions

TODO Answer the following questions. Write your answers in the appropriate variables in the module hw4/answers.py.

Question 1

What does the $\sigma^2$ hyperparameter (x_sigma2 in the code) do? Explain the effect of low and high values.

Question 2

  1. Explain the purpose of both parts of the VAE loss term - reconstruction loss and KL divergence loss.
  2. How is the latent-space distribution affected by the KL loss term?
  3. What's the benefit of this effect?

Question 3

In the formulation of the VAE loss, why do we start by maximizing the evidence distribution, $p(\bb{X})$?

Question 4

In the VAE encoder, why do we model the log of the latent-space variance corresponding to an input, $\bb{\sigma}^2_{\bb{\alpha}}$, instead of directly modelling this variance?

$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bm}[1]{{\bf #1}} \newcommand{\bb}[1]{\bm{\mathrm{#1}}} $$

Part 3: Generative Adversarial Networks

In this part we will implement and train a generative adversarial network and apply it to the task of image generation.

Obtaining the dataset

We'll use the same data as in Part 2.

But again, you can use a custom dataset, by editing the PART3_CUSTOM_DATA_URL variable in hw4/answers.py.

Create a Dataset object that will load the extraced images:

OK, let's see what we got. You can run the following block multiple times to display a random subset of images from the dataset.

Generative Adversarial Nets (GANs)

GANs, first proposed in a paper by Ian Goodfellow in 2014 are today arguably the most popular type of generative model. GANs are currently producing state of the art results in generative tasks over many different domains.

In a GAN model, two different neural networks compete against each other: A generator and a discriminator.

Training GANs

The generator is trained to generate "fake" instances which will maximally fool the discriminator into returning that they're real. Mathematically, the generator's parameters $\bb{\gamma}$ should be chosen such as to maximize the expression $$ \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$

The discriminator is trained to classify between real images, coming from the training set, and fake images generated by the generator. Mathematically, the discriminator's parameters $\bb{\delta}$ should be chosen such as to maximize the expression $$ \mathbb{E} _{\bb{x} \sim p(\bb{X}) } \log \Delta _{\bb{\delta}}(\bb{x}) \, + \, \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (1-\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$

These two competing objectives can thus be expressed as the following min-max optimization: $$ \min _{\bb{\gamma}} \max _{\bb{\delta}} \, \mathbb{E} _{\bb{x} \sim p(\bb{X}) } \log \Delta _{\bb{\delta}}(\bb{x}) \, + \, \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (1-\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$

A key insight into GANs is that we can interpret the above maximum as the loss with respect to $\bb{\gamma}$:

$$ L({\bb{\gamma}}) = \max _{\bb{\delta}} \, \mathbb{E} _{\bb{x} \sim p(\bb{X}) } \log \Delta _{\bb{\delta}}(\bb{x}) \, + \, \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (1-\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$

This means that the generator's loss function trains together with the generator itself in an adversarial manner. In contrast, when training our VAE we used a fixed L2 norm as a data loss term.

Model Implementation

We'll now implement a Deep Convolutional GAN (DCGAN) model. See the DCGAN paper for architecture ideas and tips for training.

TODO: Implement the Discriminator class in the hw4/gan.py module. If you wish you can reuse the EncoderCNN class from the VAE model as the first part of the Discriminator.

TODO: Implement the Generator class in the hw4/gan.py module. If you wish you can reuse the DecoderCNN class from the VAE model as the last part of the Generator.

Loss Implementation

Let's begin with the discriminator's loss function. Based on the above we can flip the sign and say we want to update the Discriminator's parameters $\bb{\delta}$ so that they minimize the expression $$

We're using the Discriminator twice in this expression; once to classify data from the real data distribution and once again to classify generated data. Therefore our loss should be computed based on these two terms. Notice that since the discriminator returns a probability, we can formulate the above as two cross-entropy losses.

GANs are notoriously diffucult to train. One common trick for improving GAN stability during training is to make the classification labels noisy for the discriminator. This can be seen as a form of regularization, to help prevent the discriminator from overfitting.

We'll incorporate this idea into our loss function. Instead of labels being equal to 0 or 1, we'll make them "fuzzy", i.e. random numbers in the ranges $[0\pm\epsilon]$ and $[1\pm\epsilon]$.

TODO: Implement the discriminator_loss_fn() function in the hw4/gan.py module.

Similarly, the generator's parameters $\bb{\gamma}$ should minimize the expression $$ -\mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )) $$

which can also be seen as a cross-entropy term. This corresponds to "fooling" the discriminator; Notice that the gradient of the loss w.r.t $\bb{\gamma}$ using this expression also depends on $\bb{\delta}$.

TODO: Implement the generator_loss_fn() function in the hw4/gan.py module.

Sampling

Sampling from a GAN is straightforward, since it learns to generate data from an isotropic Gaussian latent space distribution.

There is an important nuance however. Sampling is required during the process of training the GAN, since we generate fake images to show the discriminator. As you'll seen in the next section, in some cases we'll need our samples to have gradients (i.e., to be part of the Generator's computation graph).

TODO: Implement the sample() method in the Generator class within the hw4/gan.py module.

Training

Training GANs is a bit different since we need to train two models simultaneously, each with it's own separate loss function and optimizer. We'll implement the training logic as a function that handles one batch of data and updates both the discriminator and the generator based on it.

As mentioned above, GANs are considered hard to train. To get some ideas and tips you can see this paper, this list of "GAN hacks" or just do it the hard way :)

TODO:

  1. Implement the train_batch function in the hw4/gan.py module.
  2. Tweak the hyperparameters in the part3_gan_hyperparams() function within the hw4/answers.py module.

TODO:

  1. Implement the save_checkpoint function in the hw4.gan module. You can decide on your own criterion regarding whether to save a checkpoint at the end of each epoch.
  2. Run the following block to train. It will sample some images from your model every few epochs so you can see the progress.
  3. When you're satisfied with your results, rename the checkpoints file by adding _final. When you run the main.py script to generate your submission, the final checkpoints file will be loaded instead of running training. Note that your final submission zip will not include the checkpoints/ folder. This is OK.

Questions

TODO Answer the following questions. Write your answers in the appropriate variables in the module hw4/answers.py.

Question 1

Explain in detail why during training we sometimes need to maintain gradients when sampling from the GAN, and other times we don't. When are they maintained and why? When are they discarded and why?

Question 2

  1. When training a GAN to generate images, should we decide to stop training solely based on the fact that the Generator loss is below some threshold? Why or why not?

  2. What does it mean if the discriminator loss remains at a constant value while the generator loss decreases?

Question 3

Compare the results you got when generating images with the VAE to the GAN results. What's the main difference and what's causing it?

$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\cset}[1]{\mathcal{#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} \newcommand{\E}[2][]{\mathbb{E}_{#1}\left[#2\right]} \newcommand{\ip}[3]{\left<#1,#2\right>_{#3}} \newcommand{\given}[]{\,\middle\vert\,} \newcommand{\DKL}[2]{\cset{D}_{\text{KL}}\left(#1\,\Vert\, #2\right)} \newcommand{\grad}[]{\nabla} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} $$

Part 3: Summary Questions

This section contains summary questions about various topics from the course material.

You can add your answers in new cells below the questions.

Notes

CNNs

  1. Explain the meaning of the term "receptive field" in the context of CNNs.

Answer:

In a neural network context, the receptive field is defined as the size of the region in the input that produces the feature.(Wikipedia).
Can be explained as portion of the input needed, in order to create a specific feature that we are looking at, at any convolutional layer.
the receptive fields of different features partially overlap and as such cover the entire input space.
When stacking convolutional layers, the receptive fields are merged and each feature takes input from a larger area of pixels in the previous layer image.
As intuition, it can be compared to our eyes which see only parts of our vision, the receptive fields starts with small portion of the input and later grows as the convolutions combine them together in order to make sense of what is seen.
The receptive field size is affected by kernel size, padding and stride.


  1. Explain and elaborate about three different ways to control the rate at which the receptive field grows from layer to layer. Compare them to each other in terms of how they combine input features.

Answer:

Growing receptive fields from layer to layer depends on the following:

  1. Pooling- Reducing the dimension of the feature map by combining features in the same region. by doing this, it affects the convolution layers by increasingly larger parts of the input image, which results in a rapid increase in the receptive field size respectively.

  2. Stride- is how far the filter moves in every step along one direction. hence, it Determines how big the overlapping of the receptive fields between features. larger strides cause smaller overlapping portion of pixels between features thus causing larger receptive fields between layers.

  3. Dilation- By increasing this factor, the weights are placed far away at given intervals (i.e., more sparse), and the kernel size accordingly increases. Therefore, by monotonously increasing the dilation factors through layers, the receptive fields can be effectively expanded without loss of resolution.


  1. Imagine a CNN with three convolutional layers, defined as follows:

What is the size (spatial extent) of the receptive field of each "pixel" in the output tensor?

Answer:

Each receptive field size is derived from the layer before it, hence we can calculate the network receptive field recursively.
The recursive formula for the receptive field size of the output tensor: $$ r_k = r_{k-1} + (g_k-1)\cdot \prod_{i=1}^{k-1}s_i$$ $r_k$ - receptive field at layer k
$g_k$ - kernel size for layer k
$s_k$ - stride at layer k


  1. You have trained a CNN, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$, and $f_l(\cdot;\vec{\theta}_l)$ is a convolutional layer (not including the activation function).

    After hearing that residual networks can be made much deeper, you decide to change each layer in your network you used the following residual mapping instead $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)+\vec{x}$, and re-train.

    However, to your surprise, by visualizing the learned filters $\vec{\theta}_l$ you observe that the original network and the residual network produce completely different filters. Explain the reason for this.

Answer:

The main reason that the original and residual networks produce different filters lies in the fact that the filters of the residual layer try to learn the difference between the input and the output of the layer, as denoted in given formula, which was re-arranged: $$f_l(\vec{x};\vec{\theta}_l)=\vec{y}_l-\vec{x}$$


Dropout

  1. Consider the following neural network:

If we want to replace the two consecutive dropout layers with a single one defined as follows:

nn.Dropout(p=q)

what would the value of q need to be? Write an expression for q in terms of p1 and p2.

Answer:

Simplified in words:


  1. True or false: dropout must be placed only after the activation function.

Answer:

False
Usually the droput is applied after the activation functions, but nevertheless and in particular when using Relu, the dropout can be performed before applying activation, where it is even more computationally efficient.


  1. After applying dropout with a drop-probability of $p$, the activations are scaled by $1/(1-p)$. Prove that this scaling is required in order to maintain the value of each activation unchanged in expectation.

Answer:

$x$ denotes as an activation vector, with expectation of $\mathbb{E}[x]$ and $p$ is the dropout probability.
$\hat{x} = x\cdot(1-p)$ denoted as the same activation associated with dropout. the Expectation of the activation vector associated with dropout will be the following: $$\mathbb{E}[\hat{x}] = \mathbb{E}[x\cdot(1-p)]$$ Now, by using the properties of Expectation, we get: $$\mathbb{E}[\hat{x}] = \mathbb{E}[x\cdot(1-p)] = (1-p)\cdot \mathbb{E}[x]$$ $$\downarrow$$ $$\mathbb{E}[\hat{x}] = (1-p)\cdot \mathbb{E}[x]$$ Now, its easy to see that in order to maintain the value of each activation unchanged in expectation, we need to to scale the dropout activation by $1/(1-p)$
Q.E.D


Losses and Activation functions

  1. You're training a an image classifier that, given an image, needs to classify it as either a dog (output 0) or a hotdog (output 1). Would you train this model with an L2 loss? if so, why? if not, demonstrate with a numerical example. What would you use instead?

Answer:

L2 loss fit for regression tasks whereas we have here a classification task. hence, binary cross entropy will do the trick, where it penalizes the model in cases of uncertainty, thus force the model to keep learning in order to predict eventually better.
Lets demonstrate how the BCE penalize better than L2 in a case of uncertainty:
lets suppose $p_{dog} = 0.45$. which is quite an uncertain classifaction score, almost as 50-50 score. lets see which loss penalize more: $$L_2(p_{dog} = 0.45)= (1-0.45)^2 = 0.3 \\ L_{BCE}(p_{dog} = 0.45)= -log(0.45) = 0.798 $$ We see that $L_{BCE} > L_2$ and thus penalize better the uncertain prediction, because higher loss will force the model to train more than a lower loss score.


  1. After months of research into the origins of climate change, you observe the following result:

You decide to train a cutting-edge deep neural network regression model, that will predict the global temperature based on the population of pirates in N locations around the globe. You define your model as follows:

While training your model you notice that the loss reaches a plateau after only a few iterations. It seems that your model is no longer training. What is the most likely cause?

Answer:

The chosen architecture seems to be deeper than necessary, without any batch normalizations and skip connections.
Skip connections in deep architectures, as the name suggests, skip some layer in the neural network and feeds the output of one layer as the input to the next layers.
when all of the mentioned above isn't applied, we can witness a phenomenon called 'vanishing gradients.
The gradient becomes very small as we approach the earlier layers in a deep architecture. In some cases, the gradient becomes zero, meaning that we do not update the early layers at all, hence evntually the model stops its training.


  1. Referring to question 2 above: A friend suggests that if you replace the sigmoid activations with tanh, it will solve your problem. Is he correct? Explain why or why not.

Answer:

Tanh gradient also gets close to zero in a fast manner when the input gets far from zero, same as sigmoid. furthermore, tanh derivative - $sech^2$ also is bound on (0,1] as sigmoid therefore this activation change wont make any significant difference.
ReLU on the other hand, keeps linearity in the intervals where sigmoid and tanh are tend to 1, thus dealing better with vanishing gradients.


  1. Regarding the ReLU activation, state whether the following sentences are true or false and explain:
    4.1. In a model using exclusively ReLU activations, there can be no vanishing gradients.
    4.2. The gradient of ReLU is linear with its input when the input is positive.
    4.3. ReLU can cause "dead" neurons, i.e. activations that remain at a constant value of zero.

Answer:

4.1. False - Activation functions are not the only possible reason for vanishing gradients. vanishing gradients could be a result of too deep network without skip connections as well.

4.2. False - The gradient of ReLU is constant (which equals to 1) whenever the input is positive.

4.3. True - inputting negative numbers into ReLU causes an output of 0 and gradient of 0 as well. hence those weights with respect to a specific neuron dont update thus causing dead neuron. this is why the leakyReLU comes in handy, in order to deal just with these kind of cases


Optimization

  1. Explain the difference between: stochastic gradient descent (SGD), mini-batch SGD and regular gradient descent (GD).

Answer:

The difference between the optimizers above lies in the amount of samples used to train.
where GD using all of the training samples in each update, SGD using one random sample per each update and mini-batch SGD uses a fixed small 'batch' of samples in each update.
Where it is obvious that in very large datasets GD is too expensive or even impossible in term of calculation time and memory, the SGD are more feasible and quicker to converge to local minimum. mini-batch SGD in that manner is obviously better than the one sample SGD which its updates are rather too arbitrary.


  1. Regarding SGD and GD:

    2.1. Provide at least two reasons for why SGD is used more often in practice compared to GD.

    2.2. In what cases can GD not be used at all?

Answer:

2.1.
i. Slow Training: In GD, each update of the gradient can take alot of time due to the fact that GD updates all training samples per each iteration.
ii. SGD might generalize better due to adding some randomness to the model update with the selected samples while GD might overfit due to training on the full data.
2.2.
i. When the dataset is too large, or/and we know in advance that we have a low memory machine. using a great amount of samples will just train really slow in the best case or will cause the memory to run out on the worst cast


  1. You have trained a deep resnet to obtain SoTA results on ImageNet. While training using mini-batch SGD with a batch size of $B$, you noticed that your model converged to a loss value of $l_0$ within $n$ iterations (batches across all epochs) on average. Thanks to your amazing results, you secure funding for a new high-powered server with GPUs containing twice the amount of RAM. You're now considering to increase the mini-batch size from $B$ to $2B$. Would you expect the number of of iterations required to converge to $l_0$ to decrease or increase when using the new batch size? explain in detail.

Answer:

Number of iterations will surely decrease since there are more samples for the model to learn from in each update thus the gradient direction would be more accurate and less noisy resulting in smaller loss in each iteration..
Nevertheless, even if now less iterations needed for convergence, it doesnt mean that the training time will be faster, and probably on the contrary, it would take longer.


  1. For each of the following statements, state whether they're true or false and explain why.
    4.1. When training a neural network with SGD, every epoch we perform an optimization step for each sample in our dataset.
    4.2. Gradients obtained with SGD have less variance and lead to quicker convergence compared to GD.
    4.3. SGD is less likely to get stuck in local minima, compared to GD.
    4.4. Training with SGD requires more memory than with GD.
    4.5. Assuming appropriate learning rates, SGD is guaranteed to converge to a local minimum, while GD is guaranteed to converge to the global minimum.
    4.6. Given a loss surface with a narrow ravine (high curvature in one direction): SGD with momentum will converge more quickly than Newton's method which doesn't have momentum.

Answer:

4.1. True - We perform optimization step for each sample in every epoch using only one sample or a batch (mini-batch sgd).

4.2. False - SGD uses only one sample, thus the variance will be higher in such case. nonetheless this behaviour will lead to faster convergence but rather less stable.

4.3. True - Due to the fact that per each update, new random samples are trained, this situation may lead the gradient to get out local minimas towards the global minima, unlike the classical GD which tends to securly and in a stable manner to hit local minime.

4.4. False - As already stated above, GD consume more memory as it trains all of the data samples, instead of a smaller batch as SGD.

4.5. False - GD have bigger chance to hit local minimas than SGD due to training all training samples which leads to a stable, same direction of the gradient.
the SGD method gradient is less stable and the directions may vary in each iterations due to a different 'chunk' of training samples in each iterations leading to more chances to get out of local minimas, if stumbled upon.

4.6. True - Newton method uses second order derivatives which are computationally more expensive than first order SGD & momentum. furthermore, newton method tend to stuck in saddle points which can be located in narrow ravine surfaces whereas SGD with momentum stables the oscillations in a narrow ravine which vanilla SGD suffers from.


  1. Bonus (we didn't discuss this at class): We can use bi-level optimization in the context of deep learning, by embedding an optimization problem as a layer in the network. True or false: In order to train such a network, the inner optimization problem must be solved with a descent based method (such as SGD, LBFGS, etc). Provide a mathematical justification for your answer.
  1. You have trained a neural network, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$ for some arbitrary parametrized functions $f_l(\cdot;\vec{\theta}_l)$. Unfortunately while trying to break the record for the world's deepest network, you discover that you are unable to train your network with more than $L$ layers.
    6.1. Explain the concepts of "vanishing gradients", and "exploding gradients".
    6.2. How can each of these problems be caused by increased depth?
    6.3. Provide a numerical example demonstrating each.
    6.4. Assuming your problem is either of these, how can you tell which of them it is without looking at the gradient tensor(s)?

Answer:

6.1.
vanishing gradients - the gradiends decrease as the propagations occurs through the networks until they reach a stage when they are too small and considered as 0.
exploding gradients - the gradiends increase as the propagations occurs through the networks until they reach a stage when they are too big which causes the update step to be too large such that the optimizer cant fint a minimum.
To sum up those phenomena, due to the chain rule of multipication big gradients get drastically increased and small gradients get drastically decreased.
6.2.
As written above, when propogating the gradients, the current gradient value is multiplied by the value of the previous layer's gradient and so on. increaseing the layers meaning increasing the multipications thus increasing large numbers to even greater numbers and small numbers to even smaller numbers. either too high or too low gradients with too deep network can rapidly increase to infinity or decay to 0, respectivly.
6.3.
Lets assume we have a deep'ish NN with n layers. each layer multiply the input x with a weight w. finally the the result is going through the activation function.
let's denote the following with an assumption that all weights throguhout the network are equal in the magnitude in general and in our particular exam really equal: $$ w= 0.5$$ lets assume we have $n=10_{layers}$. so after the weights have been multiplied by 0.5 in each layer we have $ w= 0.5^{10} = 0.0009$ which will vanish with more layers as it tend to 0. Now if we switch to high weight, lets say $w=5$ we can see that after 10 layers $ w= 5^{10} = 9765625$ which is exploding if kept going deeper through the net which is described as exploding gradients.

6.4.
It can be interpreted by looking at the loss functions values and curve:


Backpropagation

  1. You wish to train the following 2-layer MLP for a binary classification task: $$ \hat{y}^{(i)} =\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2 $$ Your wish to minimize the in-sample loss function is defined as $$ L_{\mathcal{S}} = \frac{1}{N}\sum_{i=1}^{N}\ell(y^{(i)},\hat{y}^{(i)}) + \frac{\lambda}{2}\left(\norm{\mat{W}_1}_F^2 + \norm{\mat{W}_2}_F^2 \right) $$ Where the pointwise loss is binary cross-entropy: $$ \ell(y, \hat{y}) = - y \log(\hat{y}) - (1-y) \log(1-\hat{y}) $$

    Write an analytic expression for the derivative of the final loss $L_{\mathcal{S}}$ w.r.t. each of the following tensors: $\mat{W}_1$, $\mat{W}_2$, $\mat{b}_1$, $\mat{b}_2$, $\mat{x}$.

Answer:

We first denote the following: $$Z = \mat{W}_1 \vec{x}+ \vec{b}_1$$ Now, deriving by $x$: $$ \frac{dL_s}{dx} = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{d\hat{y}}\frac{d\hat{y}}{dx} = \frac{1}{N}\sum_{i=1}^{N}(-\frac{y_i}{\hat{y_i}} + \frac{1-y_i}{1-\hat{y_i}})\cdot W_2\cdot W_1 \cdot\frac{d\varphi}{dz}$$ Deriving by $b_1$: $$ \frac{dL_s}{db_1} = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{db_1}= \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{d\hat{y}}\frac{d\hat{y}}{db_1} = \frac{1}{N}\sum_{i=1}^{N}(-\frac{y_i}{\hat{y_i}} + \frac{1-y_i}{1-\hat{y_i}})\cdot W_2\cdot\frac{d\varphi}{dz} $$ Deriving by $b_2$: $$ \frac{dL_s}{db_2} = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{db_2}= \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{d\hat{y}}\frac{d\hat{y}}{db_2} = \frac{1}{N}\sum_{i=1}^{N}(-\frac{y_i}{\hat{y_i}} + \frac{1-y_i}{1-\hat{y_i}}) $$ Deriving by $W_1$: $$ \frac{dL_s}{dW_1} = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{dW_1} + \lambda\norm{\mat{W}_1}_F = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{d\hat{y}}\frac{d\hat{y}}{dW_1}+ \lambda\norm{\mat{W}_1}_F= \newline \frac{1}{N}\sum_{i=1}^{N}(-\frac{y_i}{\hat{y_i}} + \frac{1-y_i}{1-\hat{y_i}})\cdot W_2\frac{d\varphi}{dZ}x + \lambda\norm{\mat{W}_1}_F $$ Deriving by $W_2$: $$ \frac{dL_s}{dW_2} = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{dW_2} + \lambda\norm{\mat{W}_2}_F = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{d\hat{y}}\frac{d\hat{y}}{dW_2}+ \lambda\norm{\mat{W}_2}_F = \newline \frac{1}{N}\sum_{i=1}^{N}(-\frac{y_i}{\hat{y_i}} + \frac{1-y_i}{1-\hat{y_i}})\cdot \varphi(Z) + \lambda\norm{\mat{W}_2}_F $$


  1. The derivative of a function $f(\vec{x})$ at a point $\vec{x}_0$ is $$ f'(\vec{x}_0)=\lim_{\Delta\vec{x}\to 0} \frac{f(\vec{x}_0+\Delta\vec{x})-f(\vec{x}_0)}{\Delta\vec{x}} $$

    1. Explain how this formula can be used in order to compute gradients of neural network parameters numerically, without automatic differentiation (AD).

    2. What are the drawbacks of this approach? List at least two drawbacks compared to AD.

Answer:

The resulting drawbacks:

  1. the result is dependent on our $\epsilon$ pick. too big - poor approximation, too small - might cause floating point truncations and lead to wrong answer. both options might lead to inaccurate derivative.
  2. Stability and computabillity - this operation might not be numerically stable due to summing large numbers with small numbers and dividing them with really small number. computing gradients with multi components might be computational expensive as a large network with many parameters will lead to long time computing.

  1. Given the following code snippet:
    1. Write a short snippet that implements that calculates gradient of loss w.r.t. W and b using the approach of numerical gradients from the previous question.
    2. Calculate the same derivatives with autograd.
    3. Show, by calling torch.allclose() that your numerical gradient is close to autograd's gradient.

Sequence models

  1. Regarding word embeddings:
    1. Explain this term and why it's used in the context of a language model.
    2. Can a language model like the sentiment analysis example from the tutorials be trained without an embedding (i.e. trained directly on sequences of tokens)? If yes, what would be the consequence for the trained model? if no, why not?

Answer:


  1. Considering the following snippet, explain:
    1. What does Y contain? why this output shape?
    2. Bonus: How you would implement nn.Embedding yourself using only torch tensors.

Answer:


  1. Regarding truncated backpropagation through time (TBPTT) with a sequence length of S: State whether the following sentences are true or false, and explain.
    1. TBPTT uses a modified version of the backpropagation algorithm.
    2. To implement TBPTT we only need to limit the length of the sequence provided to the model to length S.
    3. TBPTT allows the model to learn relations between input that are at most S timesteps apart.

Answer:

3.4. False The backpropagation algorithm remains the same. TBPTT introduces a limitation in the calculation to only consider X amount of steps. This tecnique is useful to deal with vanishing gradients.

3.5. False. As stated, TBPTT limits the steps so that the number of derivatives required for weight update is controlled. By limiting the lenght of sequence, we don't make sure these steps are truncated. TBPTT implementation is based on limiting the timesteps per run.

3.6. True. As the algorithm truncates to S timesteps, it means that during forward pass and backpropagation, dedicated memory can only store those steps. For any new input, relations can be found within the available previous S timesteps that are being considered in the step run.


Attention

  1. In tutorial 5 we learned how to use attention to perform alignment between a source and target sequence in machine translation.

    1. Explain qualitatively what the addition of the attention mechanism between the encoder and decoder does to the hidden states that the encoder and decoder each learn to generate (for their language). How are these hidden states different from the model without attention?

    2. After learning that self-attention is gaining popularity thanks to the transformer models, you decide to change the model from the tutorial: instead of the queries being equal to the decoder hidden states, you use self-attention, so that the keys, queries and values are all equal to the encoder's hidden states (with learned projections, like in the tutorial..). What influence do you expect this will have on the learned hidden states?

Answer:


Unsupervised learning

  1. As we have seen, a variational autoencoder's loss is comprised of a reconstruction term and a KL-divergence term. While training your VAE, you accidentally forgot to include the KL-divergence term. What would be the qualitative effect of this on:

    1. Images reconstructed by the model during training ($x\to z \to x'$)?
    2. Images generated by the model ($z \to x'$)?

Answer:

The KL-divergence term plays the role of a regularizer to avoid overfitting to the training set. Not including it will:


  1. Regarding VAEs, state whether each of the following statements is true or false, and explain:
    1. The latent-space distribution generated by the model for a specific input image is $\mathcal{N}(\vec{0},\vec{I})$.
    2. If we feed the same image to the encoder multiple times, then decode each result, we'll get the same reconstruction.
    3. Since the real VAE loss term is intractable, what we actually minimize instead is it's upper bound, in the hope that the bound is tight.

Answer:


  1. Regarding GANs, state whether each of the following statements is true or false, and explain:
    1. Ideally, we want the generator's loss to be low, and the discriminator's loss to be high so that it's fooled well by the generator.
    2. It's crucial to backpropagate into the generator when training the discriminator.
    3. To generate a new image, we can sample a latent-space vector from $\mathcal{N}(\vec{0},\vec{I})$.
    4. It can be beneficial for training the generator if the discriminator is trained for a few epochs first, so that it's output isn't arbitrary.
    5. If the generator is generating plausible images and the discriminator reaches a stable state where it has 50% accuracy (for both image types), training the generator more will further improve the generated images.

Answer:


Detection and Segmentation

  1. What is the diffrence between IoU and Dice score? what's the diffrance between IoU and mAP? shortly explain when would you use what evaluation?

Answer:

The Intersection-Over-Union (IoU) is the area of overlap between the predicted segmentation and the ground truth divided by the area of union between the predicted segmentation and the ground truth. it can give a numerical score whether a predicted segment is close enough to the ground truth from 0(no match) to 1 - perfect prediction.
Dice on the other hand is twice(2X) the area of overlap between the predicted segmentation and the ground truth divided by the combined area of prediction and ground truth.
Dice can be used at similar circumstances as IoU and they are often both used.
but there is a subtle difference between them though: Dice score tend to veer towards the average performance whereas the IoU helps you understand worst case performance So, in general,we can use IoU to determine for each segment if the prediction is TP/FP,FN. afterwards we can build the precision-recall curve and use mAP to generalize it into a single value representing the average of all precisions from all the segments.
Nowdays, using mAP makes more sense as it is a better representation of the quality of the model, rather then using F1(Dice) to understand the imbalances between the precision and recalls of the segments.


  1. regarding of YOLO and mask-r-CNN, witch one is one stage detector? describe the RPN outputs and the YOLO output, adress how the network produce the output and the shapes of each output.

Answer:

YOLO - one stage detector. demands single pass to NN in order to predict all bounding boxes/areas.
mask-r-CNN - two stage detector. first uses RPN to generate regions of interest.
RPN outputs: A Region Proposal Network, or RPN, is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals where it takes an image as input and output set of bounding boxes proposels with repective score.
YOLO outputs: YOLO is formed of 27 CNN layers, with 24 convolutional layers, two fully connected layers, and a final detection layer. YOLO divides the input images into an N by N grid cell, then during the processing, predicts for each one of them several bounding boxes to predict the object to be detected.


$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\cset}[1]{\mathcal{#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} \newcommand{\E}[2][]{\mathbb{E}_{#1}\left[#2\right]} \newcommand{\ip}[3]{\left<#1,#2\right>_{#3}} \newcommand{\given}[]{\,\middle\vert\,} \newcommand{\DKL}[2]{\cset{D}_{\text{KL}}\left(#1\,\Vert\, #2\right)} \newcommand{\grad}[]{\nabla} $$

Part 4: Mini-Project

In this part you'll implement a small comparative-analysis project, heavily based on the materials from the tutorials and homework.

Guidelines

Spectrally-Normalized Wasserstein GANs

One of the prevailing approaches for improving training stability for GANs is to use a technique called Spectral Normalization to normalize the largest singular value of a weight matrix so that it equals 1. This approach is generally applied to the discriminator's weights in order to stabilize training. The resulting model is sometimes referred to as a SN-GAN. See Appendix A in the linked paper for the exact algorithm. You can also use pytorch's spectral_norm.

Another very common improvement to the vanilla GAN is known a Wasserstein GAN (WGAN). It uses a simple modification to the loss function, with strong theoretical justifications based on the Wasserstein (earth-mover's) distance. See the tutorial or here for a brief explanation of this loss function.

One problem with generative models for images is that it's difficult to objectively assess the quality of the resulting images. To also obtain a quantitative score for the images generated by each model, we'll use the Inception Score. This uses a pre-trained Inception CNN model on the generated images and computes a score based on the predicted probability for each class. Although not a perfect proxy for subjective quality, it's commonly used a way to compare generative models. You can use an implementation of this score that you find online, e.g. this one or implement it yourself.

You would gain a Bonus if you also adress Gradient Penalty, as we saw in the tutorial that it could improve the robustness of the GAN and essentially improve the results

Based on the linked papers, add Spectral Normalization and the Wassertein loss to your GAN from HW3. Compare between:

As a dataset, you can use LFW as in HW3 or CelebA, or even choose a custom dataset (note that there's a dataloder for CelebA in torchvision).

Your results should include:

Implementation

Short explaination about our implemenations and how to reproduce code:

TODO: This is where you should write your explanations and implement the code to display the results. See guidelines about what to include in this section.

We chose to keep following the same Bush dataset of VAE & GAN notebooks so that we can conduct an 'eye test' for the results of the project versus the VAE and GAN notebooks:

Firstly, lets present the vanilla gan on all of its components:

The following samples from Vanilla GAN were reproduced with GANTrainer.ipynb notebook. load_SN_GAN() function loads a checkpoint that was saved in colab with the following hyperparameters:
{'batch_size': 32, 'z_dim': 8, 'data_label': 1, 'label_noise': 0.1, 'discriminator_optimizer':
{'type': 'Adam', 'lr': 0.0002, 'betas': (0.5, 0.999)}, 'generator_optimizer': {'type': 'Adam', 'lr': 0.0002, 'betas': (0.3, 0.999)}}

Now, for SN_gan we'll just use torch.nn.utils.spectral_norm on each of our discriminator's modules and use the same hyperparameters as trained for VanillaGAN to see the differences

we trained the python file 'SNgan.py' on the Part2_GAN.ipynb to get the following results:

The following samples from Vanilla GAN were reproduced with GANTrainer.ipynb notebook. load_SN_GAN() function loads a checkpoint that was saved in colab with the following hyperparameters:
{'batch_size': 32, 'z_dim': 8, 'data_label': 1, 'label_noise': 0.1, 'discriminator_optimizer': {'type': 'Adam', 'lr': 0.0002, 'betas': (0.5, 0.999)}, 'generator_optimizer': {'type': 'Adam', 'lr': 0.0002, 'betas': (0.3, 0.999)}}

to conclude, the visual eye comparison between vanilla GAN and vanilla GAN with SN samples is the following:

the following was produced by running compare_imgs(device) which loads again the checkpoints that were used above

Now, lets compare both models (GAN and SN_GAN) with the inception score

We'll present the learning process where for each epoch we'll see the resulted inception score which was calculated based on 1000 samples of each model's generator.
we present two kind of graphs, 1) As Is plot - presents the amplitude of inception score in the learning process. 2) Trend plot - presents a polynomial splines trend of the inception, in order to see the trend of the inception during training in order to assess best inception scores throughout training.

We can see that spectral normalized gan model yielded slightly better inception scores trend and higher maximal score



WGAN (using Wasserstein Loss)

Now' we'll present our Wgan findings.
our Wgan was based on the same CNN architecture as the vanilla GAN' which is printed above so we'll skip reprinting it. We will focus on tuning the best critic parameter using the inception score metric. after that we'll present a sample of the produced images for the best chosen model. note:

We see that from result that even though WGan with n_critic of 1 yielded the maximal inception score, WGan with n_critic of 5 had the highest inception score trend thus in general we say that during the learning process, the generator produces better samples, based on inception metric of course.
We also noted that n_critic of 2 presented good IS results until the 50th epoch and then became less unstable and produced worse IS score than n_critic of 5 so we decided to choose n_critic of 5.

Wgan samples with $n_{critic} = 5$ :

The following samples from WGAN were reproduced with GANTrainer.ipynb notebook. load_WGAN() function loads a checkpoint that was saved in colab with the following hyperparameters:
{'batch_size': 32, 'z_dim': 8, 'discriminator_optimizer': {'type': 'RMSprop', 'lr': 0.0005}, 'generator_optimizer': {'type': 'RMSprop', 'lr': 0.0005}, 'n_critic': 5, 'c': 0.01}

Now, We'll present the same process for WGAN with spectral norm

Actually here we are uncertain whoch n_critic parameter would produce the highest score because n_critic of 20 actually resulted in a high peak trend but n_critic of 5 showed a stable monotonically increasing trend.
So, we will inspect both!

The following samples from WGAN_SN were reproduced with GANTrainer.ipynb notebook. load_WGAN() function loads a checkpoint that was saved in colab with the following hyperparameters:
{'batch_size': 32, 'z_dim': 8, 'discriminator_optimizer': {'type': 'RMSprop', 'lr': 0.0005}, 'generator_optimizer': {'type': 'RMSprop', 'lr': 0.0005}, 'n_critic': 5, 'c': 0.01}

The following samples from WGAN_SN were reproduced with GANTrainer.ipynb notebook. load_WGAN() function loads a checkpoint that was saved in colab with the following hyperparameters:
{'batch_size': 32, 'z_dim': 8, 'discriminator_optimizer': {'type': 'RMSprop', 'lr': 0.0005}, 'generator_optimizer': {'type': 'RMSprop', 'lr': 0.0005}, 'n_critic': 20, 'c': 0.01}

It is actually is kinda difficult to decide which $n_{critic}$ is better so we'll stick to $n_{critic} = 5$ for consistency when comparing WGAN to WGAN_SN

Lets present another samples of WGAN with SN and $n_{critic} = 5$ just for good sports:

Comparing Inecption scores for all the following models:

Conclusions: